Skip to content

feat(orchestrator): unified Worker lifecycle service replacing docker-proxy#451

Draft
Jing-ze wants to merge 12 commits intoagentscope-ai:mainfrom
Jing-ze:worktree-feat-docker-proxy-sae
Draft

feat(orchestrator): unified Worker lifecycle service replacing docker-proxy#451
Jing-ze wants to merge 12 commits intoagentscope-ai:mainfrom
Jing-ze:worktree-feat-docker-proxy-sae

Conversation

@Jing-ze
Copy link
Copy Markdown
Contributor

@Jing-ze Jing-ze commented Mar 27, 2026

Summary

This PR replaces the docker-proxy (a simple Docker API security proxy) with hiclaw-orchestrator — a unified Worker lifecycle service that abstracts away the underlying compute platform. The orchestrator exposes a single REST API that Manager and Workers interact with, regardless of whether workers run as local Docker containers or cloud SAE applications.

Architecture

Manager  ─── REST API ───→  Orchestrator  ───→  DockerBackend (local)
                                          ───→  SAEBackend (cloud)
                                          ───→  (future: K8s, ACS, ...)

Workers  ─── POST /credentials/sts ───→  Orchestrator  ───→  STS (scoped tokens)
         ─── POST /workers/{name}/ready ──→  Orchestrator  (readiness reporting)

What changed

Phase 1 — Restructure: Renamed docker-proxy/ to orchestrator/, restructured into proxy/, backend/, api/ packages. Defined WorkerBackend and GatewayBackend interfaces. Implemented DockerBackend. Added unified /workers/* REST API while preserving Docker API passthrough for exec/logs.

Phase 2 — Cloud backends: Implemented SAEBackend (Alibaba Cloud SAE) and APIGBackend (AI Gateway consumer management) using Go SDKs, replacing aliyun-api.py and aliyun-sae.sh. Added two-tier auth (static manager key + per-worker API keys). Added centralized STS token service — workers no longer need OIDC credentials; the orchestrator issues scoped OSS tokens via POST /credentials/sts. Worker API keys persisted to OSS for recovery across restarts.

Phase 3 — Shell simplification: Rewrote container-api.sh as a thin orchestrator API client (~170 lines, down from ~730). Simplified gateway-api.sh cloud path. Simplified create-worker.sh Step 9 into a single unified orchestrator call. Deleted aliyun-api.py (527 lines) and aliyun-sae.sh (81 lines). Removed Python SDK dependencies from Dockerfile.aliyun.

Readiness detection: SAE Create() polls until the application reaches RUNNING state. Workers self-report readiness via POST /workers/{name}/ready after agent initialization. GET /workers/{name} merges backend status with readiness: running + reported ready = ready. Unified worker_backend_wait_ready replaces Docker exec-based health polling.

Backend abstraction: WorkerBackend interface includes NeedsCredentialInjection() capability method. All backend-specific logic (credential injection, runtime env vars) is encapsulated inside each backend's Create(). Handler and main layers are fully backend-agnostic — no runtime string checks, no backend name matching. Adding a new backend (K8s, ACS) only requires implementing the interface.

Key design decisions

  • Workers have no OIDC capability — orchestrator is the sole credential issuer
  • HICLAW_ORCHESTRATOR_URL replaces both HICLAW_CONTAINER_API and the old HICLAW_ORCHESTRATOR_URL (unified)
  • Backend selection is config-driven (presence of HICLAW_SAE_WORKER_IMAGE enables SAE), not runtime-string-driven
  • STS calls use raw HTTP (no STS SDK) to minimize dependencies
  • OSS V1 signing for key persistence avoids pulling in the full OSS SDK

Test plan

  • cd orchestrator && go test ./... — all packages pass
  • Verify grep -r '"aliyun"' orchestrator/ only appears in sae.go (backend internal)
  • Verify grep -r 'IsAliyunRuntime' orchestrator/ returns 0 results
  • Verify grep -r 'b.Name()' orchestrator/api/ returns 0 results
  • make build-orchestrator — Docker image builds
  • Cloud deployment testing with SAE workers

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

❌ Integration Tests Failed

Commit: a054fbe
Workflow run: #538

Test Results
No test output captured.
Debug Log (tail)
No debug logs available.

📦 Download full debug logs & test artifacts

@Jing-ze Jing-ze marked this pull request as draft March 27, 2026 02:27
@Jing-ze Jing-ze force-pushed the worktree-feat-docker-proxy-sae branch 5 times, most recently from d9c3198 to 7139003 Compare March 27, 2026 05:42
@Jing-ze
Copy link
Copy Markdown
Contributor Author

Jing-ze commented Mar 27, 2026

CI failure is expected — the test-integration.yml on main still references build-docker-proxy and lacks the ORCHESTRATOR_IMAGE env var. Since PR workflows run from the base branch, this is a chicken-and-egg issue that resolves once merged.

Local make test passes correctly (all 14 integration tests).

@Jing-ze Jing-ze force-pushed the worktree-feat-docker-proxy-sae branch 2 times, most recently from b4d969d to 325c57d Compare March 27, 2026 08:35
Jing-ze and others added 11 commits March 30, 2026 10:45
…le service

Rename docker-proxy/ to orchestrator/ and restructure into a multi-package
Go service that exposes both a unified Worker lifecycle REST API and the
existing Docker API passthrough.

- Add WorkerBackend/GatewayBackend interfaces for pluggable backends
- Implement DockerBackend (Create/Delete/Start/Stop/Status/List via socket)
- Add /workers/* REST API with proper HTTP status mapping (409/404/503)
- Add /gateway/* API stubs (501, Phase 2 will implement APIG backend)
- Preserve Docker API passthrough with SecurityValidator for backward compat
- Add Backend Registry with auto-detection (Docker first, SAE in Phase 2)
- Update Makefile, CI workflows, install scripts with new names
- Comprehensive test coverage: backend, registry, handler, security

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rvice

Phase 2 of the orchestrator refactoring. Transforms the service from a
Docker-only proxy into a full cloud-capable control plane.

- SAE Backend: manage worker lifecycle via Alibaba Cloud SAE API (Go SDK v4)
- APIG Backend: manage AI Gateway consumers (Go SDK v6)
- Auth middleware: two-tier auth with static manager key + per-worker API keys
- STS Token Service: centralized credential issuance with per-worker OSS policy
- OSS key persistence: worker API keys stored in OSS for recovery across restarts
- Worker shell rewrite: oss-credentials.sh now uses orchestrator-mediated STS refresh
- Shared httputil package: consolidated writeJSON/writeError across packages

Workers have no OIDC capability — orchestrator is the sole credential issuer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3: replace direct Docker API calls and Python/Shell SAE wrappers
with thin orchestrator REST API client.

- Rewrite container-api.sh: worker_backend_* now call orchestrator /workers/* API
- Simplify gateway-api.sh: cloud path calls orchestrator /gateway/* API
- Simplify create-worker.sh Step 9: unified orchestrator call, no Docker/SAE split
- Delete aliyun-sae.sh and aliyun-api.py (replaced by orchestrator Go backends)
- Remove Python SDK dependencies from Dockerfile.aliyun

Net deletion: ~1100 lines of shell/Python code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Unify HICLAW_CONTAINER_API → HICLAW_ORCHESTRATOR_URL (single env var)
2. Remove HICLAW_RUNTIME from create-worker.sh (orchestrator decides)
3. Make image optional in worker create API (backend provides default)
4. Add Timestamp to STS AssumeRoleWithOIDC call
5. SAEBackend.Create() auto-injects HICLAW_RUNTIME=aliyun into worker env
6. oss-credentials.sh: support dual path (RRSA direct + orchestrator mediated)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- SAEBackend.Create() polls DescribeApplicationStatus until RUNNING (max 120s)
- New POST /workers/{name}/ready endpoint for worker self-reporting
- GET /workers/{name} merges readiness: running + reported ready = "ready"
- Worker entrypoints (openclaw + copaw) report ready to orchestrator in background
- New worker_backend_wait_ready() in container-api.sh for unified readiness polling
- create-worker.sh Step 9 uses unified wait instead of Docker exec-based polling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate all backend-specific logic from handler and main layers:

- Add NeedsCredentialInjection() to WorkerBackend interface
- Move credential injection (API key, orchestrator URL, HICLAW_RUNTIME)
  into SAEBackend.Create() — handler no longer checks b.Name()
- Replace cfg.Runtime == "aliyun" checks with config-driven backend
  registration (buildBackends function)
- Delete IsAliyunRuntime() global function
- Delete Config.Runtime field
- Backend Available() now checks own config, not global env var

"aliyun" string now only exists inside sae.go (backend internal).
Handler and main layers are fully backend-agnostic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each backend now declares its deployment mode ("local" or "cloud") via
the DeploymentMode() interface method. The API response includes a new
deployment_mode field, eliminating the backend-name-to-mode translation
in create-worker.sh (5 lines → 1 line).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g after rebase

Upstream refactor(network) replaced ExtraHosts with Docker network aliases
on the manager container. Remove leftover ExtraHosts injection in
create-worker.sh and duplicate hiclaw-net setup in install scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ensureImage() to auto-pull missing images before container create
- Handle 409 Conflict by deleting existing container and retrying once
- Add ExposedPorts/PortBindings support for CoPaw console port mapping
  with port conflict retry (up to 10 attempts)
- Pass complete env vars (FS credentials, orchestrator URL) when
  recreating workers in lifecycle-worker.sh and start-manager-agent.sh
- Pass HICLAW_WORKER_IMAGE and HICLAW_COPAW_WORKER_IMAGE to orchestrator
  container in install scripts so it knows which images to use
- Extract console_host_port from orchestrator response in create-worker.sh
  and enable-worker-console.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… persist docs, delete bug

- Worker/CoPaw readiness reporters now heartbeat every 60s after initial
  ready, so orchestrator restarts self-heal without persistence
- Add comment documenting persist-outside-lock trade-off in keys.go
- Fix _detect_worker_backend call in lifecycle-worker.sh action_delete
  (function was removed in refactor, replaced with container_api_available)
- Add backward-compat env var fallback for HICLAW_INSTALL_DOCKER_PROXY_IMAGE
- Update stale comment in copaw-worker-entrypoint.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jing-ze Jing-ze force-pushed the worktree-feat-docker-proxy-sae branch from b52bd41 to 90095d7 Compare March 30, 2026 05:44
…or rename)

Merge origin/main into feature branch, combining the new manager-copaw
build targets from main with the docker-proxy → orchestrator rename
from this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant